2. A Short Tour of the Predictive Modeling Process

Introduce the broad concepts of modeling building, i.e. building candidate models and selecting the optimal model.

2.1 Case Study: Predicting Fuel Economy

Check if data exists:



In [1]:

    
!ls -l ../datasets/FuelEconomy/









    



total 280
-rw-r--r--  1 leigong  staff  106388 Nov 18 16:08 cars2010.csv
-rw-r--r--  1 leigong  staff   23219 Nov 18 16:08 cars2011.csv
-rw-r--r--  1 leigong  staff    8969 Nov 18 16:08 cars2012.csv

In this study, we are only interested in data from 2010 and 2011. Use the DataFrame data type from pandas to store the files. It is very similar to the statistical package R's data frame.



In [2]:

    
from __future__ import division
import numpy as np
import pandas as pd

cars10 = pd.read_csv("../datasets/FuelEconomy/cars2010.csv")
cars11 = pd.read_csv("../datasets/FuelEconomy/cars2011.csv")



In [3]:

    
cars10.head(5)









    Out[3]:






  
    
      
      Unnamed: 0
      EngDispl
      NumCyl
      Transmission
      FE
      AirAspirationMethod
      NumGears
      TransLockup
      TransCreeperGear
      DriveDesc
      IntakeValvePerCyl
      ExhaustValvesPerCyl
      CarlineClassDesc
      VarValveTiming
      VarValveLift
    
  
  
    
      0
       1088
       4.7
        8
       AM6
       28.0198
       NaturallyAspirated
       6
       1
       0
       TwoWheelDriveRear
       2
       2
       2Seaters
       1
       0
    
    
      1
       1089
       4.7
        8
        M6
       25.6094
       NaturallyAspirated
       6
       1
       0
       TwoWheelDriveRear
       2
       2
       2Seaters
       1
       0
    
    
      2
       1090
       4.2
        8
        M6
       26.8000
       NaturallyAspirated
       6
       1
       0
           AllWheelDrive
       2
       2
       2Seaters
       1
       0
    
    
      3
       1091
       4.2
        8
       AM6
       25.0451
       NaturallyAspirated
       6
       1
       0
           AllWheelDrive
       2
       2
       2Seaters
       1
       0
    
    
      4
       1092
       5.2
       10
       AM6
       24.8000
       NaturallyAspirated
       6
       0
       0
           AllWheelDrive
       2
       2
       2Seaters
       1
       0

Check if there is any missing values 'NAN' in this dataset.



In [4]:

    
cars10.count()









    Out[4]:





Unnamed: 0             1107
EngDispl               1107
NumCyl                 1107
Transmission           1107
FE                     1107
AirAspirationMethod    1107
NumGears               1107
TransLockup            1107
TransCreeperGear       1107
DriveDesc              1107
IntakeValvePerCyl      1107
ExhaustValvesPerCyl    1107
CarlineClassDesc       1107
VarValveTiming         1107
VarValveLift           1107
dtype: int64



In [5]:

    
print cars10.shape
print cars11.shape









    



(1107, 15)
(245, 15)

We only restrict ourselves to a single predictor 'EngDispl' and the response 'FE' in this introductory illustration.



In [6]:

    
cars10_feature = cars10.get(['EngDispl'])
cars10_target = cars10.get(['FE'])
cars11_feature = cars11.get(['EngDispl'])
cars11_target = cars11.get(['FE'])



In [7]:

    
cars10_feature.head(5)



In [8]:

    
cars10_target.head(5)

Generally, we want to first visulize the datasets to get a better understanding before doing anything crazy. Since there is one predicator, a simple scatter plot would do the trick. The characteristics from the visulization may suggest important and necessary pre-processing steps.



In [9]:

    
%matplotlib inline
import matplotlib.pyplot as plt

# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()









    





<matplotlib.figure.Figure at 0x10e2df910>



In [10]:

    
fig, (ax1, ax2) = plt.subplots(1, 2, sharey = True)

ax1.scatter(cars10_feature, cars10_target)
ax1.set_title('2010 Model Year')
ax2.scatter(cars11_feature, cars11_target)
ax2.set_title('2011 Model Year')

fig.text(0.5, 0.04, 'Engine Displacement', ha='center', va='center')
fig.text(0.06, 0.5, 'Fuel Efficiency (MPG)', ha='center', va='center', rotation='vertical')









    Out[10]:





<matplotlib.text.Text at 0x10e3d3950>

Because of the nature of this problem, i.e. predict the MPG for a new car line, we take the 2010 data as training set and the 2011 data as test set.



In [11]:

    
# Define the evaluation metric: root mean squared error (RMSE)
from sklearn.metrics import mean_squared_error

def rmse(y_actual, y_predicted):
    '''calculate Root Mean Squared Error'''
    return np.sqrt(mean_squared_error(y_actual, y_predicted))

A good starting point is the simple linear model $$y = \beta_0 + \beta_1x,$$ where $y$ is the Fuel Efficiency (MPG) and $x$ is the Engine Displacement.



In [12]:

    
# simple linear model
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(cars10_feature, cars10_target)
print "Least square estimate: intercept = {0}, coefficient ={1}".format(reg.intercept_, reg.coef_[0])









    



Least square estimate: intercept = [ 50.56322991], coefficient =[-4.52092928]



In [13]:

    
X = np.linspace(np.min(cars10_feature)[0], np.max(cars10_feature)[0])[:, np.newaxis]
y = reg.predict(X)
cars10_target_pred = reg.predict(cars10_feature)
y_range = np.linspace(np.min(cars10_target)[0], np.max(cars10_target)[0])[:, np.newaxis]

fig, (ax1, ax2) = plt.subplots(1, 2)

ax1.scatter(cars10_feature, cars10_target)
ax1.plot(X, y, 'r')
ax1.set_title('2010 Model Year')
ax1.set_xlabel('Engine Displacement')
ax1.set_ylabel('Fuel Efficiency (MPG)')

ax2.scatter(cars10_target, cars10_target_pred)
ax2.plot(y_range, y_range, 'r--')
ax2.set_xlabel('Observed')
ax2.set_ylabel('Predicted')









    Out[13]:





<matplotlib.text.Text at 0x110f81190>

The left-hand panel shows the training set data with a linear model fit defined by the estimated slope and intercept. The right-hand panel plots the observed and predicted MPG. These plots demonstrate that this model misses some of the patterns in the data, such as under-predicting fuel efficiency when the displacement is less than 2L or above 6L



In [14]:

    
# calculate root mean square error (RMSE)
from sklearn.cross_validation import cross_val_score

scores = np.sqrt(np.abs(cross_val_score(reg, cars10_feature, cars10_target, cv=10, scoring='mean_squared_error')))
print "RMSE: {0}".format(np.mean(scores))









    



RMSE: 4.7277409602

Notice that simply re-predict the training set data is like to result in overly optimistic estimation of RMSE. An alternative approach for quantifying how well the model operates is to use resampling techniques, e.g. 10-fold cross-validation. We will cover that in Chapter 4.

Looking at the previous figure, it is conceivable that the problem might be solved by introducing some non-linearity in the model. The most basic approach is to supplement the simple linear model with additional complexity, e.g. $$y = \beta_0 + \beta_1 x + \beta_2 x^2,$$ which is referred to as quadratic model.



In [15]:

    
# quadratic model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

quad = make_pipeline(PolynomialFeatures(2), LinearRegression())
quad.fit(cars10_feature, cars10_target)

scores = np.sqrt(np.abs(cross_val_score(quad, cars10_feature, cars10_target, cv=10, scoring='mean_squared_error')))
print "RMSE: {0}".format(np.mean(scores))









    



RMSE: 4.34528462825

Reduce in the RMSE may suggest that this model is a better fit to the data.



In [16]:

    
X = np.linspace(np.min(cars10_feature)[0], np.max(cars10_feature)[0])[:, np.newaxis]
y = quad.predict(X)
cars10_target_pred = quad.predict(cars10_feature)
y_range = np.linspace(np.min(cars10_target)[0], np.max(cars10_target)[0])[:, np.newaxis]

fig, (ax1, ax2) = plt.subplots(1, 2)

ax1.scatter(cars10_feature, cars10_target)
ax1.plot(X, y, 'r')
ax1.set_title('2010 Model Year')
ax1.set_xlabel('Engine Displacement')
ax1.set_ylabel('Fuel Efficiency (MPG)')

ax2.scatter(cars10_target, cars10_target_pred)
ax2.plot(y_range, y_range, 'r--')
ax2.set_xlabel('Observed')
ax2.set_ylabel('Predicted')









    Out[16]:





<matplotlib.text.Text at 0x112067c90>

One issue with quadratic models is that they can perform poorly on the extremes of the predictor. From the above figure, one might notice that predicting new vehicles with large displacement values may produce inaccurate results.

There are other approaches for creating sophisticated relationships between the predictors and outcome. One particular technique is the multivariate adaptive regression spline (MARS) model (Friedman (1991)). When used with a single predictor, MARS can fit separate linear regression lines for different ranges of engine displacement. This model, like many machine learning algorithms, has a tuning parameter which cannot be directly estimated from the data. While the MARS model has internal algorithms for making this dtermination, the user can try different values and use resampling to determin the appropriate value. Once the value is found, a final MARS model would be fit using all the training set data and used for prediction.

A Python module py-earth on Github implemented the MARS and is likely to be merged into sklearn in the near future (see this issue).



In [17]:

    
# MARS
from pyearth import Earth

mars = Earth()
mars.fit(cars10_feature, cars10_target)
scores = np.sqrt(np.abs(cross_val_score(mars, cars10_feature, cars10_target, cv=10, scoring='mean_squared_error')))
print "RMSE: {0}".format(np.mean(scores))









    



RMSE: 4.38261607106

RMSE of MARS is similar to that of quadratic regression.



In [18]:

    
X = np.linspace(np.min(cars10_feature)[0], np.max(cars10_feature)[0])[:, np.newaxis]
y = mars.predict(X)
cars10_target_pred = mars.predict(cars10_feature)
y_range = np.linspace(np.min(cars10_target)[0], np.max(cars10_target)[0])[:, np.newaxis]

fig, (ax1, ax2) = plt.subplots(1, 2)

ax1.scatter(cars10_feature, cars10_target)
ax1.plot(X, y, 'r')
ax1.set_title('2010 Model Year')
ax1.set_xlabel('Engine Displacement')
ax1.set_ylabel('Fuel Efficiency (MPG)')

ax2.scatter(cars10_target, cars10_target_pred)
ax2.plot(y_range, y_range, 'r--')
ax2.set_xlabel('Observed')
ax2.set_ylabel('Predicted')









    Out[18]:





<matplotlib.text.Text at 0x1120e41d0>

Both quadratic model and MARS are evaluated on the test set.



In [19]:

    
X = np.linspace(np.min(cars11_feature)[0], np.max(cars11_feature)[0])[:, np.newaxis]

fig, (ax1, ax2) = plt.subplots(1, 2)

ax1.scatter(cars11_feature, cars11_target)
ax1.plot(X, quad.predict(X), 'r')
ax1.set_xlabel('Engine Displacement')
ax1.set_ylabel('Fuel Efficiency (MPG)')
ax1.set_title('Quadratic model')

ax2.scatter(cars11_feature, cars11_target)
ax2.plot(X, mars.predict(X), 'r')
ax2.set_xlabel('Engine Displacement')
ax2.set_ylabel('Fuel Efficiency (MPG)')
ax2.set_title('MARS')









    Out[19]:





<matplotlib.text.Text at 0x1125ca450>



In [20]:

    
# RMSE
quad_scores = np.sqrt(np.abs(cross_val_score(quad, cars11_feature, cars11_target, cv=10, scoring='mean_squared_error')))
mars_scores = np.sqrt(np.abs(cross_val_score(mars, cars11_feature, cars11_target, cv=10, scoring='mean_squared_error')))

print "Quadratic model RMSE: {0} and MARS RMSE: {1}".format(np.mean(quad_scores), np.mean(mars_scores))









    



Quadratic model RMSE: 4.79365661288 and MARS RMSE: 4.83734524679

The first thing to notice is that both scores are very similar, which indicates that either model is appropriate for this task. Also, both scores are lower than their previous values (fitted to 2010 data) as we would expect.

2.2 Themes

There are several aspects of the model building process that are worth discussing further.

Data Splitting

How we allocate data to certain tasks, e.g. model building, evaluating performance?
- extrapolation: order matters
- interpolation: a simple random sample of the data
How much data should be allocated to the training and test sets?
- small data sets: resampling techniques, i.e. no test set
- large data sets

Predictor Data

Feature selection: the process of determining the minimum set of relevant predictors needed by the model.

Estimating Performance

Quantitative assessments of statistics (using resampling techniques)
Visualization

Evaluating Several Models

"No Free Lunch" Theorem - Try a wide variety of techniques then determine which model to focus on.

Model Selection

between models
within the same model

Rely on cross-validation and the test set to produce quantitative assessments of the models to make the choice.

2.3 Summary

To get a reliable, trustworthy model for predicting new samples, we must first understand the data and the objective of the modeling.

	Unnamed: 0	EngDispl	NumCyl	Transmission	FE	AirAspirationMethod	NumGears	TransLockup	DriveDesc	IntakeValvePerCyl	ExhaustValvesPerCyl	CarlineClassDesc	VarValveTiming
0	1088	4.7	8	AM6	28.0198	NaturallyAspirated	6	1	TwoWheelDriveRear	2	2	2Seaters	1
1	1089	4.7	8	M6	25.6094	NaturallyAspirated	6	1	TwoWheelDriveRear	2	2	2Seaters	1
2	1090	4.2	8	M6	26.8000	NaturallyAspirated	6	1	AllWheelDrive	2	2	2Seaters	1
3	1091	4.2	8	AM6	25.0451	NaturallyAspirated	6	1	AllWheelDrive	2	2	2Seaters	1
4	1092	5.2	10	AM6	24.8000	NaturallyAspirated	6	0	AllWheelDrive	2	2	2Seaters	1